Introduction

In this practical you will learn how to carry out a supervised classification on remotely sensed data using R and QGIS. We will be using aerial imagery of an urban community garden in Melbourne, Australia, that has been collected by a drone, to create what is termed a land classification. Land classifications are useful to quantify land use, detect land-use changes or derive landscape metrics such as land-use area, number of patches, patch size or spatial aggregation and dispersion.

Before we start, you need to make sure to have the right software installed and downloaded all nessecary data to your computer. You will need R and RStudio, which you can download here and here. R is an open source language and environment for statistical computing and graphics, where you use coded commands to derive statistical and other analyses. If you already have R installed on your computer, please make sure that you are on R version 4.0 or higher. You can check the the currently installed version of R by typing version into the console and hitting enter. If you are on a lower version, please update. In this practical, I will be presenting multiple functions from a variety of package extensions. I will not always explain what each function does in detail. If you are unsure what is happening or curious, you can type a question mark ? in front of the function name into the R console, to get more information. For example to figure out what the base R function mean() does, type the following into the console and hit enter:

?mean

We will also work with QGIS, an open source GIS software, which you can download here. The current version is 3.20.0, but any version >3.0 will work for the purpose of this practical. It will also be possible to work with ARCGIS to achieve the same goal, but this practical will focus on the workflow in QGIS, mostly because it’s free and open-source.

Next, you need to download the data required for this practical. Download the archive SC_remote_sensing.zip here and extract it. Next, open the folder and double click the file SC_remote-sensing.Rproj:

You are now in an R project and it is already set up to best work with the practical and code examples that follow. If it is not already opened, navigate to the scripts folder and open the file sc_remote-sensing.R, which has some code provided that you need to run through first.

Load and install packages

We will be working with a few packages for R. Packages are software extensions that users wrote to enhance the capacity of R. In our case we will be using packages that allow spatial data analyses and machine learning modeling, which would otherwise be either very hard to do in base R, or even impossible. First, install the packages using the following code:

install.packages(c('raster',
                   'sf',
                   'rgdal',
                   'caret', 
                   'randomForest', 
                   'exactextractr', 
                   'viridis', 
                   'cvms', 
                   'broom', 
                   'patchwork', 
                   'tidyverse', 
                   'rsvg', 
                   'ggimage', 
                   'snow'))

If you have used R before, some of these might have already been installed. If R prompts you to update them, please do so. This will take a while and you can follow the progress in the console. After you ran the code above, you do not have to run it again. Nevertheless, each time we open R, we have to load the packages we require for our analyses or the respective script, before we do anything else. Therefore, run:

library(raster)
library(sf)
library(caret)
library(randomForest)
library(exactextractr)
library(viridis)
library(cvms)
library(broom)
library(patchwork)
library(tidyverse)

#setup some options for the raster package

rasterOptions(progress = 'text', timer = T)

Now we are ready to start working!

Background

Today we will be working with a stacked raster image of the Balwyn community garden in Melbourne’s eastern suburbs. The data was collected as part of a study investigating the use of remotely sensed drone imagery and other data products derived from such platforms, to enhance ground surveys in urban ecosystems, such as community gardens. If you would like to learn more about it, find the study, led by Prof. Egerer here. The garden is about 800 m2 in size and comprises a variety of planted beds, tended by different stakeholders in the garden.

Balwyn Community garden. Imagery was collected in Summer 2018.

Balwyn Community garden. Imagery was collected in Summer 2018.

If you already want to have a look at the imagery in QGIS, you can find the file in the folder data/rasters_sep, named balwyn_rgb.tif. It contains three bands, one for each visible light reflectance (red, green and blue). Today we want to attempt to use these bands to classify our image into different land-use classes using a machine learning algorithm called ‘Random Forests’. As we will control the classes that we are trying to predict today and provide our own training input for the model, we are carrying out what is termed a supervised classification. That means we teach the model what to look for, so that it can then automatically identify the classes on a larger set of data later on.

Machine learning

Rather than looking for a regression (i.e. a direct relationship between a response variable and a covariate), Machine learning is rule-based modeling. In simple terms, many linear models are combined to form conditions, describing the response variable (in our case the land-use). The final model contains many of these conditions and is called a decision tree. Let’s have a look at a mock dataset to illustrate this:

This dataset describes under what weather conditions people are likely to have a barbecue (or as the Aussies say, ‘Barbie’). The respective decision tree of this dataset would look like this:

This decision tree is what the model acts according to, to make a prediction. It basically uses the tree’s branches to derive rules as to whether a barbecue is possible based on the weather conditions or not. There are five rules here that allows the model to make the decision:

  1. IF (Outlook = Overcast) THEN BBQ = Yes

  2. IF (Outlook = Sunny) AND (Windy = FALSE) THEN BBQ = Yes

  3. IF (Outlook = Sunny) AND (Windy = TRUE) THEN BBQ = No

  4. IF (Outlook = Rainy) AND (Humidity = High) THEN BBQ = No

  5. IF (Outlook = Rainy) AND (Humidity = Normal) THEN BBQ = Yes

There are a range of approaches to do this type of modeling. Today we will be using Random Forests, a method first derived by Leo Breiman in 2001. You can find the respective paper here, in case you want to learn more. Rather than deriving rules as to whether we can do a barbecue or not though, today we will be deriving rules to describe common land-use classes in urban gardens such as crops, trees or paths & soil from visual light reflectance and object height. Our final product should be a spatial prediction of land-use in the Blawyn Community garden.

Preparing and visualizing our data

To increase our chances of deriving correct predictions of the land use in the urban garden, we will also be using a canopy height model (CHM). This is a raster containing the height of each image pixel and therfore object in the garden, above the ground. This product was also derived from the drone imagery through a process called ‘Structure from Motion’, also known as photogrammetry. First of all we want to load our raster data into R, have a look at it visually and then combine it for our modeling.

First load the two raster files:

balwyn_rgb<-stack('data/rasters_sep/balwyn_rgb.tif')
balwyn_chm<-stack('data/rasters_sep/balwyn_chm.tif')

Using plot(), we can already have a look at the data:

plot(balwyn_rgb)

plot(balwyn_chm)

We can see that the first stack contains our three visual light bands, with reflectance values re-scaled to 0 - 250. The second raster is the canopy height model (CHM), where each pixel expresses its height above the ground in meters. As the rasters do not have very informative names right now, we want to change them first:

names(balwyn_rgb)<-c('red', 'green', 'blue')
names(balwyn_chm)<-'chm'

plot(balwyn_rgb$red) #use the $ operator to plot single bands from a raster stack

So what exactly is a raster? It is a technically an array where x and y describe the location of a pixel and data is stored in the matrix. For example in our CHM, we have the height value for each location within the garden. We can check this by transforming our raster into a data.frame object:

CHM_sample<-sampleRegular(balwyn_chm, size = 5e5, asRaster = TRUE) %>%
  as.data.frame(xy = T, na.rm = T)%>%rename('CHM' = 3)

#have a peak at the data using head()
head(CHM_sample)
##             x       y      CHM
## 4998 331806.0 5812999 1.620000
## 4999 331806.1 5812999 1.620000
## 5000 331806.1 5812999 1.623390
## 5001 331806.2 5812999 1.705536
## 5002 331806.2 5812999 1.834619
## 5003 331806.3 5812999 1.931432

x and y are geographic coordinates and the CHM column contains the height in meters for this coordinate. We used the sampleRegular function from the raster package to only extract 500000 (5e5) pixels, as the raster is quite high in resolution and contains over 5 million pixels. You can check this using ncell() on a raster file. We can have a look at the data using summary or plotting a simple histogram:

summary(CHM_sample$CHM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1418  0.8016  1.1199  2.0476  1.8268 16.0900
hist(CHM_sample$CHM)

As this is an urban garden that mostly contains either small vegetable crops or bare soil and paths, most of our pixels are between 0 and 3 meters in height. We can visualize the raster in a nicer way using the ggplot2 package, but since most of our data lies between the lower end of the height scale, we should first transform it:

CHM_sample<-CHM_sample%>%mutate(CHM_fix = case_when(CHM>3 ~ 3, TRUE ~ CHM))

ggplot()+
  geom_raster(data = CHM_sample, aes(x = x, y = y, fill = CHM_fix))+
  scale_fill_viridis(option = 'F', breaks = c(0,1.5,3), labels = c('0', '1.5', '>3'))+
  coord_fixed(expand = F)+
  theme_bw()+
  labs(x = '', y = '', title = 'CHM height', fill = 'Height (m)')

We want to combine both our raster objects (or all 4 bands) into one stack and then save it, because next we need to work in QGIS with this data next.

#combine

balwyn_all<-stack(balwyn_rgb, balwyn_chm)
balwyn_all #call the variable name to get basic information including each band's range
## class      : RasterStack 
## dimensions : 3098, 1812, 5613576, 4  (nrow, ncol, ncell, nlayers)
## resolution : 0.0148, 0.0147  (x, y)
## extent     : 331799.2, 331826, 5812954, 5813000  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=55 +south +datum=WGS84 +units=m +no_defs 
## names      :        red,      green,       blue,        chm 
## min values : -3.1654527,  1.0000000, -6.9422560,  0.1303391 
## max values :    254.844,    255.000,    255.000,     16.090
#save

writeRaster(balwyn_all, 'outputs/rasters/balwyn_garden.tif', options="COMPRESS=LZW") # the LZW compression methods reduces file size drastically

Creating training polygons for supervised classification

To be able to derive the rules described above, we need to get some data describing this garden’s land-use to train our model on. We will create so called ‘training polygons’ in QGIS. These are vector shapes that we assign the respective land-use classes that they cover. In spatial supervised classification, they are also called regions of interest (ROIs). This is a manual process, but we only have to create a few of these to get enough data and have the model do the rest of the work for us. We can extract each pixel of our raster stack covered by the training polygons. This data will then be used to derive the decision tree and then classify everything that we did not manually classify and cover by polygons. Today we want to create polygons describing four major land-use classes in the Balwyn garden:

The four land-use classes we want to predict today
Class Identifier (lc) Class
1 Crops
2 Trees
3 Paths & Soil
4 Buildup
  1. Open QGIS and drag & drop the raster file we just saved in outputs/rasters/ (balwyn_garden.tif)into the Layers panel.

  2. Now go to the tab Layers > Create Layer > New Geopackage Layer

  1. In the window that opens, choose the following settings:

    1. In Database, navigate to the R project folder data/shapes and save the file as balwyn_roi.gpkg

    2. In Geometry type choose polygon

    3. Leave the coordinate reference system as is (EPSG:32755 - WGS 84 / UTM Zone 55s)

    4. In the New Field window, type lc into the Name Field, select ‘Whole Number (integer) in Type and click ’Add to Fields List’

    5. Now press OK to create the layer

  1. Select the balwyn_roi layer in the Layers panel and toggle editing in the toolbar, then select ‘Add Polygon Feature’

  1. Now zoom into and around the garden imagery and manually create polygons for each of the four classes by point and click. When you finished creating the first polygon (i.e. closed a square), press right click to assign it a land class:

  1. In the window that pops up, leave the fid as it is (‘Autogenerate’) and add the respective class identifier (lc) into the second field. In this case we want to add a one, because it is a crop.

  1. Now create one polygon for each of the other classes (one for trees, one for paths & soils and one for buildup). Trees can be found at the northern and southern end of the garden, paths should be easily identified. Buildup is anything man-made, for example a plastic cover over a garden bed:

  1. Before you add any more polygons, right click on the balwyn_roi layer and select ‘Properties’. A new window opens. Select ‘Symbology’. In the first field (‘Column’), select the field lc, then click on classify below the currently empty field. This will assign a different color to each of your polygons, depending on the class you assigned. It will help visualizing your work better and also allow for quality control in the end. You can change the colors to your liking by double-clicking on the colored squares next to the lc numbers.

  1. Now keep creating more polygons. For classes that have a variety of objects and therefore different visible light reflectance (or canopy height), use more polygons. For homogeneous classes you do not need many. Have at least 20 polygons for crops (lc = 1), soil and paths (lc = 3) and buildup (lc = 4), for trees (lc = 2), around 5 should suffice. In the end, your garden should look nice and colorful like this:

  1. When you’re finished, don’t forget to click the edit button again to stop editing and save the polygons you created to the layer.

Now we can extract and compile our training data!

Extracting and compiling modeling data

What we’ll do now is use the polygons we just created to extract data from the stacked raster containing information about visible light reflectance and height. The idea behind our modeling is, that the four classes will be distinctly different in light reflectance and therefore color, as well as height. For example, crops are mostly green, while paths are brown. Paths will also be very low in height above the ground, while crops will be a few centimeters to 1-2 meters tall. Some classes may have similar light reflectance (e.g. crops and trees), but differ significant in height. We want to use these differences to distinguish the four classes and derive a working model that can predict the classes across the entire garden automatically from our input training data.

First we can have a look at what we’ve just done visually in R as well:

#read the data

balwyn_rois<-st_read('data/shapes/balwyn_roi.gpkg') #st_read is a function of the sf package, allowing to add shapefiles to R
## Reading layer `balwyn_roi' from data source 
##   `/Users/Ben/Documents/Uni/Guest lecture Monika/Urban Gardens SC/data/shapes/balwyn_roi.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 68 features and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 331800.2 ymin: 5812955 xmax: 331822.8 ymax: 5812998
## Projected CRS: WGS 84 / UTM zone 55S
plot(balwyn_all$blue)
plot(balwyn_rois$geometry, 
     add = T, 
     col = balwyn_rois$lc)

In case you were not able to create the training polygon file, there is a backup file in data/shapes/backup. You can copy it to data/shapes and work from there. Let’s see how many polygons we created per class:

table(balwyn_rois$lc)
## 
##  1  2  3  4 
## 22  4 19 23

I created 22 polygons for crops, 4 for trees, 19 for paths & soil and 23 for buildup. This may differ for each of you as you all created your own polygon layers.

Next we will extract the data. The raster package has its own extract function, but it can be quite slow, so we will make use of the much faster exact_extract() function from exactextractr

#extract all pixel data from the raster stack balwyn all using the polygons balwyn_rois

roi_data<-exact_extract(balwyn_all, balwyn_rois, fun = NULL)

#We assign an ID to each element of the resulting list to combine it with its class later

for( i in seq_along(roi_data)){
  roi_data[[i]]$ID <- seq_along(roi_data)[i]}

length(roi_data)
## [1] 68

The output is a list of data.frames, containing the extracted pixel values for each of our polygons. In my case there are 68 separate tables. Let’s combine them using dplyr syntax:

#get modeling data in data frame form

roi_table<-dplyr::bind_rows(roi_data)%>%dplyr::select(-5) #removes an unused column
  
roi_table$lc <- as.factor(balwyn_rois$lc[roi_table$ID]) #assigns the respective land class from our polygons

#create a character column describing the land class

roi_table<-roi_table%>%mutate(class = case_when(lc == 1 ~ 'crop',
                                 lc == 2 ~ 'tree',
                                 lc == 3 ~ 'bare',
                                 lc == 4 ~ 'buildup'))

head(roi_table)
##         red    green     blue      chm ID lc class
## 1  98.58922 138.9329 64.49767 1.102906  1  1  crop
## 2  85.12378 128.1352 53.94777 1.127640  1  1  crop
## 3 123.86208 166.0410 86.13258 1.078827  1  1  crop
## 4 101.07151 142.5510 65.90388 1.103285  1  1  crop
## 5  83.10924 126.3813 49.95472 1.127604  1  1  crop
## 6  69.86186 114.3416 39.59161 1.147284  1  1  crop

As you can see, we have extracted each pixel value from the four bands available (red, green, blue and height (chm)) covered by our polygons. We now have a dataset containing ~400,000 rows of data with possible combinations of light reflectance and height to describe the land classes we assigned manually.

Creating model training and testing datasets

In statistical modeling it is important to be able to test your model performance to ensure that the model does what it’s supposed to do and is therefore useful. That means we want to know how our model performs on new data to be sure about the potential to use the model in a predictive manner. We will train our model to find the right rulesets on a subset of our data and then test how good the model does on another subset that has not been part of the model building. This is called independent model validation.

First of all, we have to check how much data we now have for each of our land-use classes:

balwyn_sample<-roi_table%>%mutate(lc = as.factor(lc))%>%as_tibble() #creates a separate dataset and make the lc column categorical

summary(balwyn_sample$lc)
##      1      2      3      4 
## 135602  66672 126652 105398

Our data is not evenly distributed because we had a different amount of training polygons for each class. Also, this is probably way too much data to build a model on and may cause our system to crash. We do not require 400,000 rows of data. Let’s reduce and balance the data:

balwyn_sample_subset<-na.omit(balwyn_sample)%>%group_by(lc)%>%sample_n(10000)

summary(as.factor(balwyn_sample_subset$lc))
##     1     2     3     4 
## 10000 10000 10000 10000
#leave only useful columns for modeling

dataset_modeling<-balwyn_sample_subset%>%ungroup()%>%dplyr::select(1:4,6)
head(dataset_modeling)
## # A tibble: 6 x 5
##     red green  blue   chm lc   
##   <dbl> <dbl> <dbl> <dbl> <fct>
## 1  50.0  99.4  40.4 1.28  1    
## 2 198.  227.  161.  0.869 1    
## 3 132.  166.  137.  1.30  1    
## 4  17.8  65.5  12.0 1.04  1    
## 5 106.  144.   82.4 1.25  1    
## 6  75.5 124.   77.6 1.09  1

Now we got 10,000 rows of data per land-use class and we removed unnecessary columns. Let’s split the data into training and testing sets. We want to build our model on 80% of the data and leave 20% for independent validation of the model to test performance. For this we can use the caret package and its function createDataPartition():

split <- createDataPartition(dataset_modeling$lc, time=1, p = 0.8, list=F)
train_subset  <- dataset_modeling[split,]
test_subset  <- dataset_modeling[-split,]

table(train_subset$lc)
## 
##    1    2    3    4 
## 8000 8000 8000 8000
table(test_subset$lc)
## 
##    1    2    3    4 
## 2000 2000 2000 2000

After splitting the data, our training dataset to build the model has 8000 rows per land-use class, our testing data has 2000.

Build the model, evaluate predictors and test model performance

Now we are ready to build a random forests model. Because our dataset contains only our response variable (lc) and the predictors (red, green, blue and chm), this is quite simple syntax. We build the model on the training dataset only. Depending on your computer specs, this may run for a few seconds up to one minute or so.

model<-randomForest(lc ~ ., data=train_subset, importance=TRUE)

model
## 
## Call:
##  randomForest(formula = lc ~ ., data = train_subset, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 1.38%
## Confusion matrix:
##      1    2    3    4 class.error
## 1 7971    0    3   26    0.003625
## 2    0 8000    0    0    0.000000
## 3    6    1 7914   79    0.010750
## 4   48    0  280 7672    0.041000

The term lc ~ . means we want predict lc from all other columns in the dataset. If we call the model object only (model), we will already get some information. An out of bounds error (OOB) of ~1% is excellent. We want this number to be as close to zero as possible. It is mainly used to compare different models. It may be possible to reduce this number by changing the number of trees or adding or removing predictors. The summary also tells us that the model created 500 decision trees (this is the default in randomForest()).

We can further investigate our model using varImpPlot() to check how our four predictor variables perfomed:

varImpPlot(model,type = 1)

This graph tells us which input variable or predictor was most important in predicting the land-use classes. We can see that height above the ground was by far the most important variable. If we removed this variable, we would decrease the model accuracy drastically (MeanDecreaseAccuracy). It’s a measure of the extent to which a variable improves the accuracy of the decision trees in predicting our land classes. Higher values mean that the variable improves prediction. It can be interpreted as showing the amount of increase in classification accuracy that is provided by including this variable in the model. We can see that green reflectance is quite important too, while compared to chm and the green band, blue and red seem negligible in describing our land-use classes.

Now we have to perform our independent validation to see how well our model actually does. Can these four predictors adequately predict land-use classes from reflectance and height data, that has not been used in training the model? Let’s find out! We simply use the predict() function and compare manually assigned and predicted land-use class in a so-called confusion matrix:

test_subset$pred<-predict(model, test_subset) #creates a new column in the test subset with the prediction of our model

confMatrix<- confusionMatrix(as.factor(test_subset$lc), test_subset$pred) #compares prediction and actual value
confMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1 1988    0    1   11
##          2    0 2000    0    0
##          3    2    0 1979   19
##          4   11    0   92 1897
## 
## Overall Statistics
##                                           
##                Accuracy : 0.983           
##                  95% CI : (0.9799, 0.9857)
##     No Information Rate : 0.259           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9773          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9935     1.00   0.9551   0.9844
## Specificity            0.9980     1.00   0.9965   0.9830
## Pos Pred Value         0.9940     1.00   0.9895   0.9485
## Neg Pred Value         0.9978     1.00   0.9845   0.9950
## Prevalence             0.2501     0.25   0.2590   0.2409
## Detection Rate         0.2485     0.25   0.2474   0.2371
## Detection Prevalence   0.2500     0.25   0.2500   0.2500
## Balanced Accuracy      0.9958     1.00   0.9758   0.9837

Now we get quite a lot of information here. Firstly you can check the Accuracy under ‘Overall Statistics’. This should be as close to 100% as possible. It illustrates how well your model predicted the land-use classes. Let’s check out our model performance one by one:

confMatrix$table
##           Reference
## Prediction    1    2    3    4
##          1 1988    0    1   11
##          2    0 2000    0    0
##          3    2    0 1979   19
##          4   11    0   92 1897

This matrix is the most important bit of information. You remember our test data had 2000 rows per land-use class. The data you assigned manually while creating the polygons is the ‘Reference’ on top and what our model made of it is the ‘Prediction’ on the left. The middle line of this matrix illustrate whenever the prediction matched the reference data, so we want these numbers to be high. For example in 1988 cases reference and prediction of class 1 (crops) matched. That is very good and means our model is doing well. Looking further down in the matrix we can also see that in 11 cases, a crop (reference = 1) was confused with buildup (prediction = 4). We call the case when reference and prediction matches a true positive. This can be expressed over all predictions we made as the true positive rate, also termed sensitivity:

confMatrix$byClass
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 1   0.9935032   0.9979997         0.9940      0.9978333    0.9940
## Class: 2   1.0000000   1.0000000         1.0000      1.0000000    1.0000
## Class: 3   0.9551158   0.9964575         0.9895      0.9845000    0.9895
## Class: 4   0.9844318   0.9830397         0.9485      0.9950000    0.9485
##             Recall        F1 Prevalence Detection Rate Detection Prevalence
## Class: 1 0.9935032 0.9937516   0.250125       0.248500                 0.25
## Class: 2 1.0000000 1.0000000   0.250000       0.250000                 0.25
## Class: 3 0.9551158 0.9720039   0.259000       0.247375                 0.25
## Class: 4 0.9844318 0.9661319   0.240875       0.237125                 0.25
##          Balanced Accuracy
## Class: 1         0.9957515
## Class: 2         1.0000000
## Class: 3         0.9757867
## Class: 4         0.9837357

The higher the sensitivity the better. The opposite is the true negative rate or specificity. This describes a correct negative or absence prediction, meaning how well the model correctly predicted when a pixel was not the class in question. We can see that both measures are close to 1 and in one case even reach 1 (or 100%) for each of our four classes, which is great. Everything outside the middle line in the matrix above are false positives or negatives, pixels that were assigned a wrong class by the model. Since there are very few, we can be confident that our model performs well overall.

A nice illustration of sensitivity and specificity can be found on the respective Wikipedia article:

Source: wikimedia.org, licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

Source: wikimedia.org, licensed under the Creative Commons Attribution-Share Alike 4.0 International license.

While there is a whole lot more information on model performance here, we can for now ignore the other measures Rest assured that the model is doing well. Since the process of splitting the data and modeling is stochastic, each of you will have slightly different results.

Deriving a spatial prediction

Now that we know our model is doing well, we can finally predict land-use across the entire Balwyn garden, which is what we set out to do in the first place. For this we also use the predict() function. But since spatial predictions can be quite computationally intensive, we want to run the function in parallel on multiple CPUs of our computer at once, to speed things up. Wait until the progress bar in the console has finished. It may take up to 5 minutes for this to run, depending on your computer.

balwyn_pred<-balwyn_all #duplicate the raster to not lose the original

beginCluster()
pred<-clusterR(balwyn_pred, raster::predict, args = list(model = model, inf.rm = T), progress = 'text')
endCluster()

In case the above code does not work for you, try the following. This may run for quite a while longer though, so wait until the little red stop sign in the top-right corner of your console has disappeared.

pred<-predict(balwyn_pred, model)

We can now look at the spatial prediction in base R:

plot(pred)

The result is a raster where each pixel contains one of four values (1 - 4), each describing our land-use classes. But let’s make it look a bit nicer and use ggplot() as we’ve done above with the CHM raster:

names_classes<-c('Crops', 'Trees', 'Soil & Paths', 'Buildup') #assign class names

colors_lc<- c('green', 'darkgreen', 'red', 'black') #choose some colors (feel free to change these!)

pred_df_sample<-sampleRegular(pred, size = 5e5, asRaster = TRUE) %>%
  as.data.frame(xy = T, na.rm = T)%>%rename('LC' = 3) #subsample as before

ggplot()+
  geom_raster(data = pred_df_sample, aes(x = x, y = y, fill = as.factor(LC)))+
  scale_fill_manual(values = colors_lc, labels = names_classes)+
  coord_fixed(expand = F)+
  theme_bw()+
  labs(x = '', y = '', title = 'Balwyn Garden Supervised Classification', fill = 'Class')

Finally we can save the spatial prediction. You can then have a look at it in QGIS as well and see how it performed by comparing with the raster we used to assign the classes.

writeRaster(pred, 'outputs/rasters/balwyn_prediction.tif', options="COMPRESS=LZW")

Evaluate the accuracy of the spatial prediction

We have tested our overall model performance using training and testing datasets, but now we have much more data to see how the model actually does across the entire garden. Using our training polygons as reference data, we can compare the spatial prediction performance in a similar manner as above.

First we have to extract each predicted pixel from the new raster, then combine this extraction with the actual value from our training polygons. Finally we combine these into a data.frame to again run a confusion matrix analysis.

test <- exact_extract(pred, balwyn_rois, fun = NULL) 

for( i in seq_along(test)){
  
  test[[i]]$ID <- seq_along(test)[i]
  
}

test_table<-dplyr::bind_rows(test)%>%dplyr::select(-2)

test_table$lc <- as.factor(balwyn_rois$lc[test_table$ID])


testProbs <- data.frame(
  obs = as.factor(test_table$lc),
  pred = as.factor(test_table$value))

confMatrix_test <- confusionMatrix(testProbs$obs, testProbs$pred)
confMatrix_test
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      1      2      3      4
##          1 134950      0     59    569
##          2      0  66670      0      0
##          3    160     16 125199   1277
##          4    548      0   3382 101460
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9862          
##                  95% CI : (0.9858, 0.9865)
##     No Information Rate : 0.3124          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9812          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9948   0.9998   0.9733   0.9821
## Specificity            0.9979   1.0000   0.9952   0.9881
## Pos Pred Value         0.9954   1.0000   0.9885   0.9627
## Neg Pred Value         0.9976   1.0000   0.9888   0.9944
## Prevalence             0.3124   0.1536   0.2962   0.2379
## Detection Rate         0.3107   0.1535   0.2883   0.2336
## Detection Prevalence   0.3122   0.1535   0.2916   0.2427
## Balanced Accuracy      0.9963   0.9999   0.9842   0.9851

We can see that we still have some false positives and negatives here, but also we are testing on a much larger set of pixels now. Overall our Accuracy, Sensitivity and Specificity have not changed a whole lot. Another indicator that we can trust our model prediction. The main confusion seems to be between classes 3 and 4:

confMatrix_test$table
##           Reference
## Prediction      1      2      3      4
##          1 134950      0     59    569
##          2      0  66670      0      0
##          3    160     16 125199   1277
##          4    548      0   3382 101460

In total we have about 4500 falsely classified pixels within the area of our training polygons. This may be due to the fact that these two classes share similar reflectance patterns and probably some heights above the ground as well. To improve these misclassifications, we can either add more training polygons, or try to add additional variables. The tree class (class 2) on the other hand had almost no confusion with other classes. Looking at the initial aerial image of the Balwyn garden and the CHM raster, we can see that the few trees here are very tall. Knowing that the chm variable was the most important predictor, we can now understand why this class was predicted so confidently.

Extract land-use class area and compile a final graph of our analysis

As a final step we can now start working with our spatial prediction and extract some data from it. Using the new raster there are lots of spatial metrics to derive. To keep things simple for the purpose of this practical, we will simply get the most basic one: land-use class area. We want to know how much space each of our four classes occupies within the gardne area. This will allow us to make some assumptions about space use and spatial preferences of the garden users. How well did they use the small space they were given? How much area is designated for crops and what fraction of the areas needed to be used for paths or is currently unused? Let’s find out!

First we have to transform the raster into a data frame. We will basically count each pixel here, so we assign each row a number that we can then summarise to calculate the area:

pred_df<-pred%>%as.data.frame(xy = T, na.rm = T)%>%
  select(1,2, class = 3)

pred_df$ID<-1

res<-res(pred)[1] #this is the resolution of the raster (pixel size), used to calculate the area in m2

garden_area<-pred_df%>%group_by(class)%>%
  summarise(area = sum(ID), aream2 = (area*res^2))%>%
  mutate(sumA = sum(area), per = 100*area/sumA)

garden_area
## # A tibble: 4 x 5
##   class    area aream2    sumA   per
##   <dbl>   <dbl>  <dbl>   <dbl> <dbl>
## 1     1 1814928  398.  3756152 48.3 
## 2     2  254347   55.7 3756152  6.77
## 3     3 1095451  240.  3756152 29.2 
## 4     4  591426  130.  3756152 15.7

The last column is the fraction (percent) of area each class (first column) occupies. We can see that around 48, so almost half of the garden is used for crops. Almost a third (~29) though is used for paths or is currently not stocked Maybe there is some potential to increase the crop area here?

Let’s visualize this to create a report for the garden users. We can plot the map and a bar graph of the table above together using the patchwork package:

b<-ggplot(garden_area, aes(x = reorder(class, -per), y = per, fill = as.factor(class)))+
  geom_bar(color = 'black', stat = 'identity')+
  scale_fill_manual(values = colors_lc, labels = names_classes)+
  scale_x_discrete(labels=c("1" = "Vegetation", "2" = "Trees",
                            "3" = "Soil & Paths", "4" = 'Buildup'))+
  labs(x = '', y = 'Garden area [%]', fill = 'Class')+
  theme_bw()
  
#get the code from above for the map, but remove the legend:

a<-ggplot()+
  geom_raster(data = pred_df_sample, aes(x = x, y = y, fill = as.factor(LC)), show.legend = F)+
  scale_fill_manual(values = colors_lc, labels = names_classes)+
  coord_fixed(expand = F)+
  theme_bw()+
  labs(x = '', y = '', fill = 'Class')

#combine:
  
a + b + plot_annotation(tag_levels = 'A', title='Supervised classification of a Melbourne Community Garden',
                        subtitle = 'A = Spatial prediction, B = Class areas',
                        caption = 'Guest lecture by Benjamin Wagner -  Supervised classification on remotely sensed data - July 2021')

Looks nice :) We can save this figure as a report in PDF format as well:

ggsave('report.pdf',path = 'outputs/figures/', width = 25, height = 20, units = 'cm', dpi = 600)

Closing remarks

So there we go. Using a relatively simple aerial image of an urban garden, along with height data derived from the same data collection, you were able to create a complex spatial prediction of urban garden land-use using supervised classification. Congratulations on becoming a spatial data analyst! Of course there may be a few other steps that follow on from this. You can try to improve the model performance or create further classes to get an even more accurate representation of space use in this and potentially other gardens with similar land-use classes in the city. Nevertheless you got a glimpse into the workings of land classifications from spatial data and learned about the respective functions and functionalities in R and QGIS.

To illustrate this modeling process I was not able to explain each function we used here in detail. If this was your first time using R and you want to get some more basic introduction to ease into it, or you want to learn more, please have a look at the tutorials I have compiled with some colleagues of mine at https://bennywag.github.io/forest-ecology-R-book/index.html. If you have any questions regarding this material, feel free to write me an email at benjamin.wagner [at] unimelb.edu.au.